Global Retail Sales Lakehouse

1. Executive Summary

The Global Retail Sales Lakehouse project delivers a full end-to-end data platform on Azure using the Medallion Architecture (Bronze, Silver, Gold). It ingests multi-region sales data from more than 50 global locations, processes it using Azure Data Factory, Azure Databricks, and Delta Lake, and provides curated analytics through Azure Synapse and Power BI. The solution enables real-time and batch processing, maintains 99.9% accuracy at the Gold layer, and gives the business a unified global view with self-service analytics.

2. Architecture Overview

The architecture follows a streamlined flow: data is ingested into ADLS Gen2 through ADF pipelines, stored as raw files in the Bronze layer, transformed and cleansed in the Silver layer with SCD-2 logic, and finally aggregated in the Gold layer for business reporting. Azure SQL Data Warehouse and Power BI consume this curated Gold data for dashboards and insights. This design ensures scalability, governance, and cost efficiency across the entire pipeline.

3. Technology Stack

  • Ingestion: Azure Data Factory (ADF)
  • Storage: ADLS Gen2, Delta Lake
  • Processing: Azure Databricks (PySpark)
  • Warehousing: Azure Synapse / SQL DW
  • Analytics: Power BI
  • Governance: Azure Purview, Unity Catalog
  • Monitoring: Azure Monitor, Log Analytics

4. Medallion Data Model

Bronze Layer: Stores all raw incoming CSV/Parquet data in its original format.

Silver Layer: Standardizes, cleanses, and builds SCD-2 dimensions and conformed facts for downstream reliability.

Gold Layer: Produces curated, aggregated tables such as store-level daily sales and product hierarchy views, optimized for fast reporting and real-time dashboards.

5. ETL Processing

Daily Ingestion: ADF pipelines capture new files from REST and SFTP sources and land them into Bronze storage, validating schema and row counts.

Transformation: Databricks transforms Bronze data into trustworthy Silver datasets with cleansing, SCD-2 logic, and upserts into Delta tables.

Aggregation: PySpark jobs compute store-level metrics and load Gold fact tables. This three-pipeline setup ensures accuracy, automation, and transparency.

6. Project Timeline (12 Weeks)

  • Week 1–2 (Foundation): Setup ADF, ADLS containers, IAM, Unity Catalog, and basic environment.
  • Week 3–4 (Bronze Layer): Onboard 10 regions, enable ingestion, schema registry, and monitoring.
  • Week 5–7 (Silver Layer): Implement SCD-2 transformations, data quality rules, and lineage tracking.
  • Week 8–9 (Gold Layer): Build aggregation jobs, load Azure SQL DW, tune performance.
  • Week 10–11 (Power BI & UAT): Develop dashboards, conduct user training, validate business logic.
  • Week 12 (Go-Live): Execute cutover, run final loads, switch Power BI to new sources.

7. Testing & Deployment

Testing includes unit validation, integration checks, data quality testing, and performance tuning to ensure Gold queries run in under five seconds. Deployment uses a blue-green strategy, performing a clean cutover from the legacy environment with a monitored 48-hour stabilization window. Rollback is supported by switching Power BI back to the old warehouse if needed.

8. Monitoring & Maintenance

After going live, the platform is monitored to maintain over 99% pipeline success rates, keep cluster utilization below 80%, ensure data freshness within four hours, and manage cloud cost under $12K per month. Alerts and dashboards support proactive issue detection and reporting.

9. Roles & Responsibilities

  • 🚀 Data Engineers: Manage pipelines and notebooks.
  • 🏗️ Data Architects: Design Medallion layers and governance.
  • 📊 BI Developers: Build Power BI models and dashboards.
  • ⚙️ DevOps: Ensure CI/CD, automation, and monitoring.
  • 📋 Project Manager: Handle delivery, risks, and communication.